Chinese Word Segmentation Using Minimal Linguistic Knowledge
نویسنده
چکیده
This paper presents a primarily data-driven Chinese word segmentation system and its performances on the closed track using two corpora at the first international Chinese word segmentation bakeoff. The system consists of a new words recognizer, a base segmentation algorithm, and procedures for combining single characters, suffixes, and checking segmentation consistencies.
منابع مشابه
Can MDL Improve Unsupervised Chinese Word Segmentation?
It is often assumed that MinimumDescription Length (MDL) is a good criterion for unsupervised word segmentation. In this paper, we introduce a new approach to unsupervised word segmentation of Mandarin Chinese, that leads to segmentations whose Description Length is lower than what can be obtained using other algorithms previously proposed in the literature. Suprisingly, we show that this lower...
متن کاملChinese Word Segmentation
Chinese word segmentation has been a very important research topic not only because it is usually the very first step for Chinese text processing, but also because its high accuracy is a prerequisite for a high performance Chinese text processing such as Chinese input, speech recognition, machine translation and language understanding, etc. This paper gives a review on the development of Chines...
متن کاملCreate a Manual Chinese Word Segmentation Dataset Using Crowdsourcing Method
The manual Chinese word segmentation dataset WordSegCHC 1.0 which was built by eight crowdsourcing tasks conducted on the Crowdflower platform contains the manual word segmentation data of 152 Chinese sentences whose length ranges from 20 to 46 characters without punctuations. All the sentences received 200 segmentation responses in their corresponding crowdsourcing tasks and the numbers of val...
متن کاملAdversarial Multi-Criteria Learning for Chinese Word Segmentation
Different linguistic perspectives causes many diverse segmentation criteria for Chinese word segmentation (CWS). Most existingmethods focus on improve the performance for each single criterion. However, it is interesting to exploit these different criteria and mining their common underlying knowledge. In this paper, we propose adversarial multi-criteria learning for CWS by integrating shared kn...
متن کاملText Window Denoising Autoencoder: Building Deep Architecture for Chinese Word Segmentation
Deep learning is the new frontier of machine learning research, which has led to many recent breakthroughs in English natural language processing. However, there are inherent differences between Chinese and English, and little work has been done to apply deep learning techniques to Chinese natural language processing. In this paper, we propose a deep neural network model: text window denoising ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Journal of Chinese Language and Computing
دوره 14 شماره
صفحات -
تاریخ انتشار 2003